1 Dataset used:

Lending Club loan data - loans.csv Traffic sign image data - knn_traffic_signs.csv Donatiion data - donors.csv Brett’s location data - locations.csv

2 Four Classification Algorithms

This beginner-level introduction to machine learning covers four of the most common classification algorithms. You will come away with a basic understanding of how each algorithm approaches a learning task, as well as learn the R functions needed to apply these tools to your own work.

3 1: k-Nearest Neighbors (kNN)

As the kNN algorithm literally ‘learns by example’ it is a case in point for starting to understand supervised machine learning. This chapter will introduce classification while working through the application of kNN to self-driving vehicle road sign recognition.

4 2: Naive Bayes

Naive Bayes uses principles form the field of statistics to make preditions. This chapter will introduce the basics of Bayesian methods while exploring how to apply these techniques to iPhone-like destination suggestions.

5 3: Logistics Regression

Logistic regression involves fitting a curve to numeric data to make predictions about binary events. Arguably one of the most widely used machine learning methods, this chapter will provide an overview of the technique while illustrating how to apply it to fundraising data.

6 4: Classification Trees

Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We’ll use the Lending Club dataset to simulate this scenario.

Instructor Brett Lanz - author book Machine Learning with R

7 Classification with Nearest Neighbors

8 Classification tasks for driverless cars

images of stops signs(mostly red), walk sign (mostly green), adjust speed signs (mostly bluish grey and number in black) * the image here illustrates the dataset * I suspect you already see some similarities, the machine can too * a nearest neighbour classifier takes advantage of the fact that the signs that look alike should be similar to or nearby other signs of the same type * for example if the car observes a sign that seems similar to those in the group of stop sign the will probably need to stop * so how does nearest neighbour learner decide whether two signs are similar * it does so by literally measuring the distance between them * that not to say that it measures distance between the signs in the physical space * a stop sign in New York is the same as the stop sign in Los Angeles * but instead it imagines the properties of the signs in coordinates what is called a feature space * consider for instance a signs colour * by imagining the colour as a three dimensional feature space measuring levels of red, green and blue * signs of similar colour are located naturally close to one another * once the feature space has been constructed in this way you can measure distance using formulat like

dist(p,q) = square root of sum of squares of p1-q1, p2-q2, …pn-qn

9 Recognizing a road sign with kNN

After several trips with a human behind the wheel, it is time for the self-driving car to attempt the test course alone.

As it begins to drive away, its camera captures the following image:

Stop Sign

Can you apply a kNN classifier to help the car recognize this sign?

install.packages("dplyr")
install.packages("readr")
install.packages("ggplot2")
install.packages("purrr")
install.packages("class")
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 3.5.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
library(purrr)
## Warning: package 'purrr' was built under R version 3.5.3
library(class)
## Warning: package 'class' was built under R version 3.5.3
signs = read_csv("C:/shobha/R/DataCamp/dataFiles/CSV-files/knn_traffic_signs.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   sample = col_character(),
##   sign_type = col_character()
## )
## See spec(...) for full column specifications.
glimpse(signs)
## Observations: 206
## Variables: 51
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ sample    <chr> "train", "train", "train", "train", "train", "train"...
## $ sign_type <chr> "pedestrian", "pedestrian", "pedestrian", "pedestria...
## $ r1        <dbl> 155, 142, 57, 22, 169, 75, 136, 118, 149, 13, 123, 1...
## $ g1        <dbl> 228, 217, 54, 35, 179, 67, 149, 105, 225, 34, 124, 1...
## $ b1        <dbl> 251, 242, 50, 41, 170, 60, 157, 69, 241, 28, 107, 13...
## $ r2        <dbl> 135, 166, 187, 171, 231, 131, 200, 244, 34, 5, 83, 3...
## $ g2        <dbl> 188, 204, 201, 178, 254, 89, 203, 245, 45, 21, 61, 4...
## $ b2        <dbl> 101, 44, 68, 26, 27, 53, 107, 67, 1, 11, 26, 37, 26,...
## $ r3        <dbl> 156, 142, 51, 19, 97, 214, 150, 132, 155, 123, 116, ...
## $ g3        <dbl> 227, 217, 51, 27, 107, 144, 167, 123, 226, 154, 124,...
## $ b3        <dbl> 245, 242, 45, 29, 99, 75, 134, 12, 238, 140, 115, 12...
## $ r4        <dbl> 145, 147, 59, 19, 123, 156, 171, 138, 147, 21, 67, 4...
## $ g4        <dbl> 211, 219, 62, 27, 147, 169, 218, 123, 222, 46, 67, 5...
## $ b4        <dbl> 228, 242, 65, 29, 152, 190, 252, 85, 242, 41, 52, 49...
## $ r5        <dbl> 166, 164, 156, 42, 221, 67, 171, 254, 170, 36, 70, 1...
## $ g5        <dbl> 233, 228, 171, 37, 236, 50, 158, 254, 191, 60, 53, 1...
## $ b5        <dbl> 245, 229, 50, 3, 117, 36, 108, 92, 113, 26, 26, 141,...
## $ r6        <dbl> 212, 84, 254, 217, 205, 37, 157, 241, 26, 75, 26, 60...
## $ g6        <dbl> 254, 116, 255, 228, 225, 36, 186, 240, 37, 108, 26, ...
## $ b6        <dbl> 52, 17, 36, 19, 80, 42, 11, 108, 12, 44, 21, 18, 20,...
## $ r7        <dbl> 212, 217, 211, 221, 235, 44, 26, 254, 34, 13, 52, 9,...
## $ g7        <dbl> 254, 254, 226, 235, 254, 42, 35, 254, 45, 27, 45, 13...
## $ b7        <dbl> 11, 26, 70, 20, 60, 44, 10, 99, 19, 25, 27, 17, 20, ...
## $ r8        <dbl> 188, 155, 78, 181, 90, 192, 180, 108, 221, 133, 117,...
## $ g8        <dbl> 229, 203, 73, 183, 110, 131, 211, 106, 249, 163, 109...
## $ b8        <dbl> 117, 128, 64, 73, 9, 73, 236, 27, 184, 126, 83, 33, ...
## $ r9        <dbl> 170, 213, 220, 237, 216, 123, 129, 135, 226, 83, 110...
## $ g9        <dbl> 216, 253, 234, 234, 236, 74, 109, 123, 246, 125, 74,...
## $ b9        <dbl> 120, 51, 59, 44, 66, 22, 73, 40, 59, 19, 12, 12, 18,...
## $ r10       <dbl> 211, 217, 254, 251, 229, 36, 161, 254, 30, 13, 98, 2...
## $ g10       <dbl> 254, 255, 255, 254, 255, 34, 190, 254, 40, 27, 70, 1...
## $ b10       <dbl> 3, 21, 51, 2, 12, 37, 10, 115, 34, 25, 26, 11, 20, 2...
## $ r11       <dbl> 212, 217, 253, 235, 235, 44, 161, 254, 34, 9, 20, 28...
## $ g11       <dbl> 254, 255, 255, 243, 254, 42, 190, 254, 44, 23, 21, 2...
## $ b11       <dbl> 19, 21, 44, 12, 60, 44, 6, 99, 35, 18, 20, 19, 13, 1...
## $ r12       <dbl> 172, 158, 66, 19, 163, 197, 187, 138, 241, 85, 113, ...
## $ g12       <dbl> 235, 225, 68, 27, 168, 114, 215, 123, 255, 128, 76, ...
## $ b12       <dbl> 244, 237, 68, 29, 152, 21, 236, 85, 54, 21, 14, 12, ...
## $ r13       <dbl> 172, 164, 69, 20, 124, 171, 141, 118, 205, 83, 106, ...
## $ g13       <dbl> 235, 227, 65, 29, 117, 102, 142, 105, 229, 125, 69, ...
## $ b13       <dbl> 244, 237, 59, 34, 91, 26, 140, 75, 46, 19, 9, 12, 13...
## $ r14       <dbl> 172, 182, 76, 64, 188, 197, 189, 131, 226, 85, 102, ...
## $ g14       <dbl> 228, 228, 84, 61, 205, 114, 171, 124, 246, 128, 67, ...
## $ b14       <dbl> 235, 143, 22, 4, 78, 21, 140, 5, 59, 21, 6, 12, 13, ...
## $ r15       <dbl> 177, 171, 82, 211, 125, 123, 214, 106, 235, 85, 106,...
## $ g15       <dbl> 235, 228, 93, 222, 147, 74, 221, 94, 252, 128, 69, 4...
## $ b15       <dbl> 244, 196, 17, 78, 20, 22, 201, 53, 67, 21, 9, 11, 18...
## $ r16       <dbl> 22, 164, 58, 19, 160, 180, 188, 101, 237, 83, 43, 60...
## $ g16       <dbl> 52, 227, 60, 27, 183, 107, 211, 91, 254, 125, 29, 45...
## $ b16       <dbl> 53, 237, 60, 29, 187, 26, 227, 59, 53, 19, 11, 18, 1...

Remove the id and sample columns from signs dataset

#signs = signs %>% select(-id, -sample) 
#or
#signs = signs[, -c(1,2)]
#or
signs = signs[, -(1:2)]
names(signs)
##  [1] "sign_type" "r1"        "g1"        "b1"        "r2"       
##  [6] "g2"        "b2"        "r3"        "g3"        "b3"       
## [11] "r4"        "g4"        "b4"        "r5"        "g5"       
## [16] "b5"        "r6"        "g6"        "b6"        "r7"       
## [21] "g7"        "b7"        "r8"        "g8"        "b8"       
## [26] "r9"        "g9"        "b9"        "r10"       "g10"      
## [31] "b10"       "r11"       "g11"       "b11"       "r12"      
## [36] "g12"       "b12"       "r13"       "g13"       "b13"      
## [41] "r14"       "g14"       "b14"       "r15"       "g15"      
## [46] "b15"       "r16"       "g16"       "b16"
# creating test observatiion to predict label using knn
next_sign = signs[206, -1]

glimpse(next_sign)
## Observations: 1
## Variables: 48
## $ r1  <dbl> 204
## $ g1  <dbl> 227
## $ b1  <dbl> 220
## $ r2  <dbl> 196
## $ g2  <dbl> 59
## $ b2  <dbl> 51
## $ r3  <dbl> 202
## $ g3  <dbl> 67
## $ b3  <dbl> 59
## $ r4  <dbl> 204
## $ g4  <dbl> 227
## $ b4  <dbl> 220
## $ r5  <dbl> 236
## $ g5  <dbl> 250
## $ b5  <dbl> 234
## $ r6  <dbl> 242
## $ g6  <dbl> 252
## $ b6  <dbl> 235
## $ r7  <dbl> 205
## $ g7  <dbl> 148
## $ b7  <dbl> 131
## $ r8  <dbl> 190
## $ g8  <dbl> 50
## $ b8  <dbl> 43
## $ r9  <dbl> 179
## $ g9  <dbl> 70
## $ b9  <dbl> 57
## $ r10 <dbl> 242
## $ g10 <dbl> 229
## $ b10 <dbl> 212
## $ r11 <dbl> 190
## $ g11 <dbl> 50
## $ b11 <dbl> 43
## $ r12 <dbl> 193
## $ g12 <dbl> 51
## $ b12 <dbl> 44
## $ r13 <dbl> 170
## $ g13 <dbl> 197
## $ b13 <dbl> 196
## $ r14 <dbl> 190
## $ g14 <dbl> 50
## $ b14 <dbl> 43
## $ r15 <dbl> 190
## $ g15 <dbl> 47
## $ b15 <dbl> 41
## $ r16 <dbl> 165
## $ g16 <dbl> 195
## $ b16 <dbl> 196

Create a vector of sign labels to use with kNN by extracting the column sign_type from signs.

Identify the next_sign using the knn() function.

Set the train argument equal to the signs data frame without the first column.

Set the test argument equal to the data frame next_sign.

Use the vector of labels you created as the cl argument.

# Load the 'class' package

# Create a vector of labels
sign_types <- signs$sign_type

str(sign_types)
##  chr [1:206] "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
# chr [1:206] "pedestrian" "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...

# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)
## [1] stop
## Levels: pedestrian speed stop

Awesome! You’ve trained your first nearest neighbor classifier!

Thinking like kNN With your help, the test car successfully identified the sign and stopped safely at the intersection.

How did the knn() function correctly classify the stop sign?

Possible Answers It learned that stop signs are red

The sign was in some way similar to another stop sign (answer)

Stop signs have eight sides

The other types of signs were less likely

Correct! kNN isn’t really learning anything; it simply looks for the most similar example.

10 Exploring the traffic sign dataset

To better understand how the knn() function was able to classify the stop sign, it may help to examine the training dataset it used.

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.

The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.

Knn Stop Sign

Knn Stop Sign

blue part - red:204 green:227 blue:220 red part - red:193 green :52 blue:44

Use the str() function to examine the signs dataset.

Use table() to count the number of observations of each sign type by passing it the column containing the labels.

Run the provided aggregate() command to see whether the average red level might vary by sign type.

# Examine the structure of the signs dataset
str(signs)
## Classes 'tbl_df', 'tbl' and 'data.frame':    206 obs. of  49 variables:
##  $ sign_type: chr  "pedestrian" "pedestrian" "pedestrian" "pedestrian" ...
##  $ r1       : num  155 142 57 22 169 75 136 118 149 13 ...
##  $ g1       : num  228 217 54 35 179 67 149 105 225 34 ...
##  $ b1       : num  251 242 50 41 170 60 157 69 241 28 ...
##  $ r2       : num  135 166 187 171 231 131 200 244 34 5 ...
##  $ g2       : num  188 204 201 178 254 89 203 245 45 21 ...
##  $ b2       : num  101 44 68 26 27 53 107 67 1 11 ...
##  $ r3       : num  156 142 51 19 97 214 150 132 155 123 ...
##  $ g3       : num  227 217 51 27 107 144 167 123 226 154 ...
##  $ b3       : num  245 242 45 29 99 75 134 12 238 140 ...
##  $ r4       : num  145 147 59 19 123 156 171 138 147 21 ...
##  $ g4       : num  211 219 62 27 147 169 218 123 222 46 ...
##  $ b4       : num  228 242 65 29 152 190 252 85 242 41 ...
##  $ r5       : num  166 164 156 42 221 67 171 254 170 36 ...
##  $ g5       : num  233 228 171 37 236 50 158 254 191 60 ...
##  $ b5       : num  245 229 50 3 117 36 108 92 113 26 ...
##  $ r6       : num  212 84 254 217 205 37 157 241 26 75 ...
##  $ g6       : num  254 116 255 228 225 36 186 240 37 108 ...
##  $ b6       : num  52 17 36 19 80 42 11 108 12 44 ...
##  $ r7       : num  212 217 211 221 235 44 26 254 34 13 ...
##  $ g7       : num  254 254 226 235 254 42 35 254 45 27 ...
##  $ b7       : num  11 26 70 20 60 44 10 99 19 25 ...
##  $ r8       : num  188 155 78 181 90 192 180 108 221 133 ...
##  $ g8       : num  229 203 73 183 110 131 211 106 249 163 ...
##  $ b8       : num  117 128 64 73 9 73 236 27 184 126 ...
##  $ r9       : num  170 213 220 237 216 123 129 135 226 83 ...
##  $ g9       : num  216 253 234 234 236 74 109 123 246 125 ...
##  $ b9       : num  120 51 59 44 66 22 73 40 59 19 ...
##  $ r10      : num  211 217 254 251 229 36 161 254 30 13 ...
##  $ g10      : num  254 255 255 254 255 34 190 254 40 27 ...
##  $ b10      : num  3 21 51 2 12 37 10 115 34 25 ...
##  $ r11      : num  212 217 253 235 235 44 161 254 34 9 ...
##  $ g11      : num  254 255 255 243 254 42 190 254 44 23 ...
##  $ b11      : num  19 21 44 12 60 44 6 99 35 18 ...
##  $ r12      : num  172 158 66 19 163 197 187 138 241 85 ...
##  $ g12      : num  235 225 68 27 168 114 215 123 255 128 ...
##  $ b12      : num  244 237 68 29 152 21 236 85 54 21 ...
##  $ r13      : num  172 164 69 20 124 171 141 118 205 83 ...
##  $ g13      : num  235 227 65 29 117 102 142 105 229 125 ...
##  $ b13      : num  244 237 59 34 91 26 140 75 46 19 ...
##  $ r14      : num  172 182 76 64 188 197 189 131 226 85 ...
##  $ g14      : num  228 228 84 61 205 114 171 124 246 128 ...
##  $ b14      : num  235 143 22 4 78 21 140 5 59 21 ...
##  $ r15      : num  177 171 82 211 125 123 214 106 235 85 ...
##  $ g15      : num  235 228 93 222 147 74 221 94 252 128 ...
##  $ b15      : num  244 196 17 78 20 22 201 53 67 21 ...
##  $ r16      : num  22 164 58 19 160 180 188 101 237 83 ...
##  $ g16      : num  52 227 60 27 183 107 211 91 254 125 ...
##  $ b16      : num  53 237 60 29 187 26 227 59 53 19 ...
# Count the number of signs of each type
table(signs$sign_type)
## 
## pedestrian      speed       stop 
##         65         70         71
# Check r10's average red level by sign type 
# average of red colour in each of the sign types
aggregate(r10 ~ sign_type, data = signs, mean)
##    sign_type       r10
## 1 pedestrian 108.78462
## 2      speed  83.08571
## 3       stop 142.50704
# same can be done achieved as below using dplyr verbs
signs %>% 
  group_by(sign_type) %>%
  summarise(mean(r10))
## # A tibble: 3 x 2
##   sign_type  `mean(r10)`
##   <chr>            <dbl>
## 1 pedestrian       109. 
## 2 speed             83.1
## 3 stop             143.

Great work! As you might have expected, stop signs tend to have a higher average red value. This is how kNN identifies similar signs.

11 Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:

Speed Sign

Speed Sign

Ped Sign

Ped Sign

Stop Sign

Stop Sign

At the conclusion of the trial, you are asked to measure the car’s overall performance at recognizing these signs.

The class package and the dataset signs are already loaded in your workspace. So is the dataframe test_signs, which holds a set of observations you’ll test your model on.

12 Classify the test_signs data using knn():

Set train equal to the observations in signs without labels.

Use test_signs for the test argument, again without labels.

For the cl argument, use the vector of labels provided for you.

Use table() to explore the classifier’s performance at identifying the three sign types.

Create the vector signs_actual by extracting the labels from test_signs.

Pass the vector of predictions and the vector of actual signs to table() to cross tabulate them.

Compute the overall accuracy of the kNN learner using the mean() function.

test_index = c(8,13,14,19,20,22,29,30,36,44,45,46,
                       47,50,52,53,57,62,63,66,67,69,74,75,
                       78,82,84,100,101,103,110,113,117,123,124,130,
                       131,132,133,135,137,140,142,143,148,151,154,156,
                       157,164,174,175,181,183,192,193,201,203,205)

test_signs = signs[test_index,]
dim(test_signs)
## [1] 59 49
#[1] 59 49

train_signs = signs[-test_index,]
dim(train_signs)
## [1] 147  49
#[1] 147  49

dim(signs)
## [1] 206  49
#[1] 206  49

# Use kNN to identify the test road signs
sign_types = train_signs$sign_type
signs_pred = knn(train = train_signs[, -1], 
                 test = test_signs[-1], 
                 cl = sign_types)


# Create a confusion matrix of the actual versus predicted values
signs_actual = test_signs$sign_type
table(signs_actual,signs_pred)
##             signs_pred
## signs_actual pedestrian speed stop
##   pedestrian         19     0    0
##   speed               2    17    2
##   stop                0     0   19
# Compute the accuracy
mean(signs_actual ==  signs_pred)
## [1] 0.9322034

Fantastic! That self-driving car is really coming along! The confusion matrix lets you look for patterns in the classifier’s errors.

13 What about the ‘k’ in kNN

KNN Neighbors

KNN Neighbors

Bigger ‘k’ is not always better

KNN Impact Small

KNN Impact Small

KNN Impact Large

KNN Impact Large

14 Understanding the impact of ‘k’

There is a complex relationship between k and classification accuracy. Bigger is not always better.

Which of these is a valid reason for keeping k as small as possible (but no smaller)?

ANSWER THE QUESTION

Possible Answers A smaller k requires less processing power

A smaller k reduces the impact of noisy data

A smaller k minimizes the chance of a tie vote

A smaller k may utilize more subtle patterns (answer)

Yes! With smaller neighborhoods, kNN can identify more subtle patterns in the data.

Testing other ‘k’ values By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

#cl implies vector of lables
# sign prediction with k = 1
k_1 = knn(train = train_signs[-1], 
          test = test_signs[-1], 
          cl = train_signs$sign_type,
          k = 1)

k_1
##  [1] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
##  [7] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [13] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [19] pedestrian stop       pedestrian speed      speed      speed     
## [25] speed      speed      speed      stop       pedestrian speed     
## [31] speed      speed      speed      speed      speed      speed     
## [37] speed      speed      speed      speed      stop       stop      
## [43] stop       stop       stop       stop       stop       stop      
## [49] stop       stop       stop       stop       stop       stop      
## [55] stop       stop       stop       stop       stop      
## Levels: pedestrian speed stop
# accuracy
mean(signs_actual == k_1)
## [1] 0.9322034
# sign prediction with k = 7
k_7 = knn(train = train_signs[-1], 
          test = test_signs[-1], 
          cl = train_signs$sign_type,
          k = 7)

k_7
##  [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
##  [7] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [13] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [19] pedestrian speed      speed      speed      speed      speed     
## [25] speed      speed      speed      stop       speed      speed     
## [31] speed      speed      speed      speed      speed      speed     
## [37] speed      speed      speed      speed      stop       stop      
## [43] stop       stop       stop       stop       stop       stop      
## [49] stop       stop       stop       stop       stop       stop      
## [55] stop       stop       stop       stop       stop      
## Levels: pedestrian speed stop
# accuracy
mean(signs_actual == k_7)
## [1] 0.9661017
#sign prediction with k = 15
k_15 = knn(train = train_signs[-1], 
          test = test_signs[-1], 
          cl = train_signs$sign_type,
          k = 15)

k_15
##  [1] pedestrian stop       pedestrian stop       stop       pedestrian
##  [7] pedestrian pedestrian pedestrian pedestrian pedestrian pedestrian
## [13] pedestrian speed      pedestrian pedestrian pedestrian pedestrian
## [19] stop       speed      speed      speed      speed      speed     
## [25] speed      speed      speed      stop       speed      speed     
## [31] speed      speed      speed      speed      speed      speed     
## [37] speed      speed      speed      speed      stop       stop      
## [43] stop       stop       stop       stop       stop       stop      
## [49] stop       stop       stop       stop       stop       stop      
## [55] stop       stop       stop       stop       stop      
## Levels: pedestrian speed stop
# accuracy
mean(signs_actual == k_15)
## [1] 0.8983051

You’re a kNN pro! Which value of k gave the highest accuracy? k = 7

Seeing how the neighbors voted When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

In this exercise, you will learn how to obtain the voting results from the knn() function.

Build a kNN model with the prob = TRUE parameter to compute the vote proportions. Set k = 7. Use the attr() function to obtain the vote proportions for the predicted class. These are stored in the attribute ‘prob’. Examine the first several vote outcomes and percentages using the head() function to see how the confidence varies from sign to sign.

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred_k_7 = knn(train = train_signs[-1],
          test = test_signs[-1],
          cl = train_signs$sign_type,
          k = 7,
          prob = TRUE)



# Get the "prob" attribute from the predicted classes
sign_prob = attr(sign_pred_k_7, "prob")


# Examine the first several predictions
head(sign_pred_k_7)
## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop
"
[1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
Levels: pedestrian speed stop
"
## [1] "\n[1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian\nLevels: pedestrian speed stop\n"
# Examine the proportion of votes for the winning class
head(sign_prob)
## [1] 0.5714286 0.5714286 0.8571429 0.5714286 0.8571429 0.5714286
#[1] 0.5714286 0.5714286 0.8571429 0.5714286 0.8571429 0.5714286

Wow! Awesome job! Now you can get an idea of how certain your kNN learner is about its classifications.

15 Data preparation for KNN

16 kNN assumes numeric data

speed limit sign - rectangle = 1, diamond = 0 pedestrian crossing - rectangle = 1, diamond = 1 stop sign - rectangle = 0, diamond = 0

17 kNN benefits from normalized data

KNN Normalize Before

KNN Normalize Before

KNN Normalize After

KNN Normalize After

18 Normalizing data in R

# define a min-max normalize() function
normalize = function(x){
    return((x - min(x)) /  (max(x) - min(x)))
}

# normalized version of r1
summary(normalize(signs$r1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1935  0.3528  0.4046  0.6129  1.0000
# un-normalized version of r1
summary(signs$r1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    51.0    90.5   103.3   155.0   251.0

19 Why normalize data?

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization. What is the purpose of this step?

ANSWER THE QUESTION

Possible Answers To ensure all data elements may contribute equal shares to distance. (answer)

To help the kNN algorithm converge on a solution faster.

To convert all of the data elements to numbers.

To redistribute the data as a normal bell curve.

Yes! Rescaling reduces the influence of extreme values on kNN’s distance function.

20 Understanding Bayesian methods

21 Estimating probability

NN Locations

NN Locations

The probability of A is denoted P(A)

P(work) = 23 / 40 = 57.5% P(store) = 4 / 40 = 10.0%

Joint probaility and independent events * when events occur together they have a joint probability their intersection can be depicted using a venn diagram * like those shown here

NB Venn

NB Venn

The joint probability of events A and B is denoted P(A and B)

P(work and evening) = 1% P(work and afternoon) = 20%

Conditional probability and dependent events

NB Venn

NB Venn

The conditional probability of events A and B is denoted P(A|B) - it is their joint probability divided by probability of B

P(A|B) = P(A and B)/P(B)

P(work | evening) = 1/25 = 4% P(work | afternoon) = 20/25 = 80%

22 Making predictions with Naive Bayes

# building a Naive Bayes model
library(naivebayes)

#m = naive_bayes(location ~ time_of_day, data = location_history)

# making predictions with Naive Bayes
#future_location = predict(m, future_conditions)

# Create data frame as below from locations.csv
#str(locations)
library(naivebayes)

locations_org = read_csv("C:/shobha/R/DataCamp/dataFiles/CSV-files/locations.csv")
## Parsed with column specification:
## cols(
##   month = col_double(),
##   day = col_double(),
##   weekday = col_character(),
##   daytype = col_character(),
##   hour = col_double(),
##   hourtype = col_character(),
##   location = col_character()
## )
dim(locations_org)
## [1] 2184    7
glimpse(locations_org)
## Observations: 2,184
## Variables: 7
## $ month    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day      <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,...
## $ weekday  <chr> "wednesday", "wednesday", "wednesday", "wednesday", "...
## $ daytype  <chr> "weekday", "weekday", "weekday", "weekday", "weekday"...
## $ hour     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ hourtype <chr> "night", "night", "night", "night", "night", "night",...
## $ location <chr> "home", "home", "home", "home", "home", "home", "home...
length(unique(locations_org$day))
## [1] 31
length(unique(locations_org$month))
## [1] 4
length(unique(locations_org$weekday))
## [1] 7
length(unique(locations_org$daytype))
## [1] 2
length(unique(locations_org$hour))
## [1] 24
length(unique(locations_org$hourtype))
## [1] 4
length(unique(locations_org$location))
## [1] 7
unique(locations_org$month)
## [1] 1 2 3 4
unique(locations_org$hourtype)
## [1] "night"     "morning"   "afternoon" "evening"
unique(locations_org$hour)
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [24] 23
head(locations_org)
## # A tibble: 6 x 7
##   month   day weekday   daytype  hour hourtype location
##   <dbl> <dbl> <chr>     <chr>   <dbl> <chr>    <chr>   
## 1     1     4 wednesday weekday     0 night    home    
## 2     1     4 wednesday weekday     1 night    home    
## 3     1     4 wednesday weekday     2 night    home    
## 4     1     4 wednesday weekday     3 night    home    
## 5     1     4 wednesday weekday     4 night    home    
## 6     1     4 wednesday weekday     5 night    home
length(unique(locations_org$month))
## [1] 4
length(unique(locations_org$weekday))
## [1] 7
length(unique(locations_org$daytype))
## [1] 2
length(unique(locations_org$hour))
## [1] 24
length(unique(locations_org$hourtype))
## [1] 4
length(unique(locations_org$location))
## [1] 7
length(unique(locations_org$day))
## [1] 31
unique(locations_org$month)
## [1] 1 2 3 4
unique(locations_org$hourtype)
## [1] "night"     "morning"   "afternoon" "evening"
unique(locations_org$hour)
##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
## [24] 23
head(locations_org)
## # A tibble: 6 x 7
##   month   day weekday   daytype  hour hourtype location
##   <dbl> <dbl> <chr>     <chr>   <dbl> <chr>    <chr>   
## 1     1     4 wednesday weekday     0 night    home    
## 2     1     4 wednesday weekday     1 night    home    
## 3     1     4 wednesday weekday     2 night    home    
## 4     1     4 wednesday weekday     3 night    home    
## 5     1     4 wednesday weekday     4 night    home    
## 6     1     4 wednesday weekday     5 night    home
locations = locations_org %>%
  select(daytype, hourtype, location)

str(locations)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2184 obs. of  3 variables:
##  $ daytype : chr  "weekday" "weekday" "weekday" "weekday" ...
##  $ hourtype: chr  "night" "night" "night" "night" ...
##  $ location: chr  "home" "home" "home" "home" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   month = col_double(),
##   ..   day = col_double(),
##   ..   weekday = col_character(),
##   ..   daytype = col_character(),
##   ..   hour = col_double(),
##   ..   hourtype = col_character(),
##   ..   location = col_character()
##   .. )
# convert all columns to factor
locations = locations %>%
  mutate_all(factor)

str(locations)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2184 obs. of  3 variables:
##  $ daytype : Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hourtype: Factor w/ 4 levels "afternoon","evening",..: 4 4 4 4 4 4 3 3 3 3 ...
##  $ location: Factor w/ 7 levels "appointment",..: 3 3 3 3 3 3 3 3 3 4 ...
where9am = locations_org %>%
  filter(hour == 9) %>%
  select(daytype, location)

dim(where9am)
## [1] 91  2

23 Computing probabilities

The where9am data frame contains 91 days (thirteen weeks) worth of data in which Brett recorded his location at 9am each day as well as whether the daytype was a weekend or weekday.

Using the conditional probability formula below, you can compute the probability that Brett is working in the office, given that it is a weekday.

P(A|B) = P(A and B)/P(B)

Calculations like these are the basis of the Naive Bayes destination prediction model you’ll develop in later exercises.

head(where9am)
## # A tibble: 6 x 2
##   daytype location
##   <chr>   <chr>   
## 1 weekday office  
## 2 weekday office  
## 3 weekday office  
## 4 weekend home    
## 5 weekend home    
## 6 weekday campus

Find P(office) using nrow() and subset() to count rows in the dataset and save the result as p_A.

Find P(weekday), using nrow() and subset() again, and save the result as p_B.

Use nrow() and subset() a final time to find P(office and weekday). Save the result as p_AB.

Compute P(office | weekday) and save the result as p_A_given_B.

Print the value of p_A_given_B

# Using dplyr
atOffice = where9am %>% filter(location == "office")
isWeekday = where9am %>% filter(daytype == "weekday")
atOffice_isWeekday =where9am %>% filter(location == "office", daytype == "weekday")

head(atOffice)
## # A tibble: 6 x 2
##   daytype location
##   <chr>   <chr>   
## 1 weekday office  
## 2 weekday office  
## 3 weekday office  
## 4 weekday office  
## 5 weekday office  
## 6 weekday office
head(isWeekday)
## # A tibble: 6 x 2
##   daytype location   
##   <chr>   <chr>      
## 1 weekday office     
## 2 weekday office     
## 3 weekday office     
## 4 weekday campus     
## 5 weekday home       
## 6 weekday appointment
head(atOffice_isWeekday)
## # A tibble: 6 x 2
##   daytype location
##   <chr>   <chr>   
## 1 weekday office  
## 2 weekday office  
## 3 weekday office  
## 4 weekday office  
## 5 weekday office  
## 6 weekday office
(p_A = nrow(atOffice)/nrow(where9am))
## [1] 0.4285714
(p_B = nrow(isWeekday)/nrow(where9am))
## [1] 0.7142857
(p_AB = nrow(atOffice_isWeekday)/nrow(where9am))
## [1] 0.4285714
(p_A_given_B = p_AB/p_B)
## [1] 0.6
# using subscript
p_A = nrow(where9am[which(where9am$location == "office"),])/nrow(where9am)
p_A
## [1] 0.4285714
p_B = nrow(where9am[which(where9am$daytype == "weekday"),])/nrow(where9am)
p_B
## [1] 0.7142857
p_AB = nrow(where9am[which( where9am$location == "office" & 
                              where9am$daytype == "weekday"),])/nrow(where9am)
p_AB
## [1] 0.4285714
(p_A_given_B = p_AB/p_B)
## [1] 0.6

Great work! In a lot of cases, calculating probabilities is as simple as counting.

24 Understanding dependent events

In the previous exercise, you found that there is a 55% chance Brett is in the office at 9am given that it is a weekday. On the other hand, if Brett is never in the office on a weekend, which of the following is/are true?

Possible Answers P(office and weekend) = 0.

P(office | weekend) = 0.

Brett’s location is dependent on the day of the week.

All of the above. (answer)

Correct! Because the events do not overlap, knowing that one occurred tells you much about the status of the other.

A simple Naive Bayes location model The previous exercises showed that the probability that Brett is at work or at home at 9am is highly dependent on whether it is the weekend or a weekday.

To see this finding in action, use the where9am data frame to build a Naive Bayes model on the same data.

You can then use this model to predict the future: where does the model think that Brett will be at 9am on Thursday and at 9am on Saturday?

The dataframe where9am is available in your workspace. This dataset contains information about Brett’s location at 9am on different days.

Load the naivebayes package.

Use naive_bayes() with a formula like y ~ x to build a model of location as a function of daytype. Forecast the Thursday 9am location using predict() with the thursday9am object as the newdata argument. Do the same for predicting the saturday9am location

#install.packages("naivebayes")
library(naivebayes)

locmodel <- naive_bayes(location ~ daytype, data = where9am)

locmodel 
## ===================== Naive Bayes ===================== 
## Call: 
## naive_bayes.formula(formula = location ~ daytype, data = where9am)
## 
## A priori probabilities: 
## 
## appointment      campus        home      office 
##  0.01098901  0.10989011  0.45054945  0.42857143 
## 
## Tables: 
##          
## daytype   appointment    campus      home    office
##   weekday   1.0000000 1.0000000 0.3658537 1.0000000
##   weekend   0.0000000 0.0000000 0.6341463 0.0000000
names(where9am)
## [1] "daytype"  "location"
(thursday9am = data.frame(daytype = "weekday"))
##   daytype
## 1 weekday
saturday9am = data.frame(daytype = "weekend")
saturday9am
##   daytype
## 1 weekend
# making predictions with Naive Bayes
(predict(locmodel,  thursday9am))
## [1] office
## Levels: appointment campus home office
(predict(locmodel, saturday9am))
## [1] office
## Levels: appointment campus home office

This is incorrect

Answer should be as below

predict(locmodel, saturday9am)
## [1] office
## Levels: appointment campus home office
# However, below gives correct answer
(predict(locmodel, "weekend"))
## [1] home
## Levels: appointment campus home office

Awesome job! Not surprisingly, Brett is most likely at the office at 9am on a Thursday, but at home at the same time on a Saturday!

25 Examining ‘raw’ probabilities

The naivebayes package offers several ways to peek inside a Naive Bayes model.

Typing the name of the model object provides the a priori (overall) and conditional probabilities of each of the model’s predictors. If one were so inclined, you might use these for calculating posterior (predicted) probabilities by hand.

Alternatively, R will compute the posterior probabilities for you if the type = ‘prob’ parameter is supplied to the predict() function.

Using these methods, examine how the model’s predicted 9am location probability varies from day-to-day.

The model locmodel that you fit in the previous exercise is in your workspace.

Print the locmodel object to the console to view the computed a priori and conditional probabilities.

Use the predict() function similarly to the previous exercise, but with type = ‘prob’ to see the predicted probabilities for Thursday at 9am.

Compare these to the predicted probabilities for Saturday at 9am.

(predict(locmodel,  thursday9am, type = "prob"))
##      appointment    campus      home office
## [1,]  0.01538462 0.1538462 0.2307692    0.6
(predict(locmodel,  "weekend", type = "prob"))
##      appointment    campus      home    office
## [1,]  0.01098901 0.1098901 0.4505495 0.4285714

Should be :

appointment campus home office

[1,] 0 0 1 0

Fantastic! Did you notice the predicted probability of Brett being at the office on a Saturday is zero?

26 Understanding independence

Understanding the idea of event independence will become important as you learn more about how ‘naive’ Bayes got its name. Which of the following is true about independent events?

Possible Answers The events cannot occur at the same time.

A Venn diagram will always show no intersection.

Knowing the outcome of one event does not help predict the other. (answer)

At least one of the events is completely random.

Yes! One event is independent of another if knowing one doesn’t give you information about how likely the other is. For example, knowing if it’s raining in New York doesn’t help you predict the weather in San Francisco. The weather events in the two cities are independent of each other.

27 Understanding NB’s ‘naiivety’

28 The challenge of multiple predictors

NB Multi Venn

NB Multi Venn

A ‘naive’ simplification NB Simplification.

29 An ‘infrequent’ problem

NB Infrequent.

NB Infrequent.

The Laplace correction {la pluz}

NB Laplace.

NB Laplace.

30 Who are you calling naive?

The Naive Bayes algorithm got its name because it makes a ‘naive’ assumption about event independence.

What is the purpose of making this assumption?

Possible Answers Independent events can never have a joint probability of zero.

The joint probability calculation is simpler for independent events. (answer)

Conditional probability is undefined for dependent events.

Dependent events cannot be used to make predictions.

Yes! The joint probability of independent events can be computed much more simply by multiplying their individual probabilities.

31 A more sophisticated location model

The locations dataset records Brett’s location every hour for 13 weeks. Each hour, the tracking information includes the daytype (weekend or weekday) as well as the hourtype (morning, afternoon, evening, or night).

Using this data, build a more sophisticated model to see how Brett’s predicted location not only varies by the day of week but also by the time of day.

The dataset locations is already loaded in your workspace.

Use the R formula interface to build a model where location depends on both daytype and hourtype. Recall that the function naive_bayes() takes 2 arguments: formula and data.

Predict Brett’s location on a weekday afternoon using the dataframe weekday_afternoon and the predict() function.

Do the same for a weekday_evening

names(locations)
## [1] "daytype"  "hourtype" "location"
locmodel = naive_bayes(location ~ daytype + hourtype ,data = locations )

weekday_afternoon = data.frame(daytype = "weekday",
                               hourtype = "afternoon",
                               location = "office")

weekday_evening = data.frame(daytype = "weekday",
                               hourtype = "evening",
                               location = "home")

predict(locmodel, weekday_afternoon)
## [1] office
## Levels: appointment campus home office restaurant store theater
predict(locmodel, weekday_evening)
## [1] office
## Levels: appointment campus home office restaurant store theater

This is incorrect, should be as below

[1] home Levels: appointment campus home office restaurant store theater

Great job! Your Naive Bayes model forecasts that Brett will be at the office on a weekday afternoon and at home in the evening.

weekday_afternoon daytype hourtype location 13 weekday afternoon office weekday_evening daytype hourtype location 19 weekday evening home

32 Preparing for unforeseen circumstances

While Brett was tracking his location over 13 weeks, he never went into the office during the weekend. Consequently, the joint probability of P(office and weekend) = 0.

Explore how this impacts the predicted probability that Brett may go to work on the weekend in the future. Additionally, you can see how using the Laplace correction will allow a small chance for these types of unforeseen circumstances.

The model locmodel is already in your workspace, along with the dataframe weekend_afternoon.

Use the locmodel to output predicted probabilities for a weekend afternoon by using the predict() function. Remember to set the type argument.

Create a new naive Bayes model with the Laplace smoothing parameter set to 1. You can do this by setting the laplace argument in your call to naive_bayes(). Save this as locmodel2.

See how the new predicted probabilities compare by using the predict() function on your new model.

#rm(weekend_afternoon)

weekend_afternoon = locations %>%
  filter(daytype == "weekend", hourtype == "afternoon", location == "home") %>%
  head(1)

weekend_afternoon
## # A tibble: 1 x 3
##   daytype hourtype  location
##   <fct>   <fct>     <fct>   
## 1 weekend afternoon home
str(weekend_afternoon)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  3 variables:
##  $ daytype : Factor w/ 2 levels "weekday","weekend": 2
##  $ hourtype: Factor w/ 4 levels "afternoon","evening",..: 1
##  $ location: Factor w/ 7 levels "appointment",..: 3
predict(locmodel, weekend_afternoon, type = "prob")
##      appointment campus      home office restaurant      store theater
## [1,]  0.02472535      0 0.8472217      0  0.1115693 0.01648357       0
locmodel2 = naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)

predict(locmodel2, weekend_afternoon, type = "prob")
##      appointment      campus      home      office restaurant      store
## [1,]  0.01107985 0.005752078 0.8527053 0.008023444  0.1032598 0.01608175
##          theater
## [1,] 0.003097769

Fantastic work! Adding the Laplace correction allows for the small chance that Brett might go to the office on the weekend in the future.

Understanding the Laplace correction By default, the naive_bayes() function in the naivebayes package does not use the Laplace correction. What is the risk of leaving this parameter unset?

Possible Answers Some potential outcomes may be predicted to be impossible. (answer)

The algorithm may have a divide by zero error.

Naive Bayes will ignore features with zero values.

The model may not estimate probabilities for some cases.

Correct! The small probability added to every outcome ensures that they are all possible even if never previously observed.

Applying Naive Bayes to other problems

How Naive Bayes uses data

NB Data.

NB Data.

Binning numeric data for Naive Bayes * a technique called binning is a simple technique for creating categories from numeric data * the idea is to divide a range of numbers into a series of sets called bins

NB Binning 1

NB Binning 1

33 Preparing text data for Naive Bayes

NB BagOfWords

NB BagOfWords

34 Handling numeric predictors

Numeric data is often binned before it is used with Naive Bayes. Which of these is not an example of binning?

ANSWER THE QUESTION

Possible Answers age values recoded as ‘child’ or ‘adult’ categories

geographic coordinates recoded into geographic regions (West, East, etc.)

test scores divided into four groups by percentile

income values standardized to follow a normal bell curve (answer)

Right! Transforming income values into a bell curve doesn’t create a set of categories.

35 Making binary predictions with regression

36 Introducing linear regression

Linear Regression 1

Linear Regression 1

{linear positive regression line}

Regression for binary classification * but suppose you have a binary term outcome instead * something that can take 1 or 0 values * like donate or not donate * constructing a plot of y vs x, the points fall on two flat rows rather than spread across

Linear Regression Binary 1

Linear Regression Binary 1

Linear Regression Binary 2

Linear Regression Binary 2

37 Introducing logistic regression

Linear Regression Logistic 2

Linear Regression Logistic 2

Making predictions with logistic regression * in R the logistic regression uses the glm function

\[m = glm(y ~ x1 + x2 + x3, data = my_dataset, family = 'binomial')\]

\[prob = predict(m, test_dataset, type = response)\]

\[pred = ifelse(prob > 0.50, 1, 0)\]

donors_org = read_csv("C:/shobha/R/DataCamp/dataFiles/CSV-files/donors.csv")
## Parsed with column specification:
## cols(
##   donated = col_double(),
##   veteran = col_double(),
##   bad_address = col_double(),
##   age = col_double(),
##   has_children = col_double(),
##   wealth_rating = col_double(),
##   interest_veterans = col_double(),
##   interest_religion = col_double(),
##   pet_owner = col_double(),
##   catalog_shopper = col_double(),
##   recency = col_character(),
##   frequency = col_character(),
##   money = col_character()
## )
dim(donors_org)
## [1] 93462    13
names(donors_org)
##  [1] "donated"           "veteran"           "bad_address"      
##  [4] "age"               "has_children"      "wealth_rating"    
##  [7] "interest_veterans" "interest_religion" "pet_owner"        
## [10] "catalog_shopper"   "recency"           "frequency"        
## [13] "money"

38 Building simple logistic regression models

The donors dataset contains 93,462 examples of people mailed in a fundraising solicitation for paralyzed military veterans. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise. This binary outcome will be the dependent variable for the logistic regression model.

The remaining columns are features of the prospective donors that may influence their donation behavior. These are the model’s independent variables.

When building a regression model, it it often helpful to form a hypothesis about which independent variables will be predictive of the dependent variable. The bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

In this exercise, you will use these three factors to create a simple model of donation behavior.

{We build the glm regression model for lable donated using three independent variables/features bad_address, interest_veteran, interest_religion. - Then we use the model to predict the probalities for donation and save as a new variable. - then we calculate the mean of actual donated variable/label - then we make binary prediction using the if else condition that if prob is greater than the average donated assign predicted value as 1 otherwise 0 - the accuracy is then deduced as mean of the observations where predicted and actual value of donation is the same}

The dataset donors is available in your workspace.

Examine donors using the str() function.

Count the number of occurrences of each level of the donated variable using the table() function.

Fit a logistic regression model using the formula interface and the three independent variables described above.

Summarize the model object with summary()

donors = donors_org
glimpse(donors)
## Observations: 93,462
## Variables: 13
## $ donated           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ veteran           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bad_address       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ age               <dbl> 60, 46, NA, 70, 78, NA, 38, NA, NA, 65, NA, ...
## $ has_children      <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ wealth_rating     <dbl> 0, 3, 1, 2, 1, 0, 2, 3, 1, 0, 1, 2, 1, 0, 2,...
## $ interest_veterans <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ interest_religion <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pet_owner         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ catalog_shopper   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ recency           <chr> "CURRENT", "CURRENT", "CURRENT", "CURRENT", ...
## $ frequency         <chr> "FREQUENT", "FREQUENT", "FREQUENT", "FREQUEN...
## $ money             <chr> "MEDIUM", "HIGH", "MEDIUM", "MEDIUM", "MEDIU...
# Examine the dataset to identify potential independent variables
str(donors)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 93462 obs. of  13 variables:
##  $ donated          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : num  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : num  0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : num  0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: num  0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : chr  "CURRENT" "CURRENT" "CURRENT" "CURRENT" ...
##  $ frequency        : chr  "FREQUENT" "FREQUENT" "FREQUENT" "FREQUENT" ...
##  $ money            : chr  "MEDIUM" "HIGH" "MEDIUM" "MEDIUM" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   donated = col_double(),
##   ..   veteran = col_double(),
##   ..   bad_address = col_double(),
##   ..   age = col_double(),
##   ..   has_children = col_double(),
##   ..   wealth_rating = col_double(),
##   ..   interest_veterans = col_double(),
##   ..   interest_religion = col_double(),
##   ..   pet_owner = col_double(),
##   ..   catalog_shopper = col_double(),
##   ..   recency = col_character(),
##   ..   frequency = col_character(),
##   ..   money = col_character()
##   .. )
# Explore the dependent variable
table(donors = donors$donated)
## donors
##     0     1 
## 88751  4711
table(donors = donors$donated, bad_address = donors$bad_address)
##       bad_address
## donors     0     1
##      0 87440  1311
##      1  4660    51

Bad address donated 51 and correct address donated 4660 which equals total donated 4711

table(donors = donors$donated, interest_veterans = donors$interest_veterans)
##       interest_veterans
## donors     0     1
##      0 78998  9753
##      1  4133   578
table(donors = donors$donated, interest_religion = donors$interest_religion)
##       interest_religion
## donors     0     1
##      0 80439  8312
##      1  4231   480

Interest in religion donated 480 and not donated 4231 which equals 4711

# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_veterans + interest_religion, data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)
## 
## Call:
## glm(formula = donated ~ bad_address + interest_veterans + interest_religion, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

Great work! With the model built, you can now use it to make predictions!

39 Making a binary prediction

In the previous exercise, you used the glm() function to build a logistic regression model of donor behavior. As with many of R’s machine learning methods, you can apply the predict() function to the model object to forecast future behavior. By default, predict() outputs predictions in terms of log odds unless type = ‘response’ is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive. For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

The dataset donors and the model donation_model are already loaded in your workspace.

Use the predict() function to estimate each person’s donation probability. Use the type argument to get probabilities. Assign the predictions to a new column called donation_prob.

Find the actual probability that an average person would donate by passing the mean() function the appropriate column of the donors dataframe.

Use ifelse() to predict a donation if their predicted donation probability is greater than average. Assign the predictions to a new column called donation_pred.

Use the mean() function to calculate the model’s accuracy.

# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")
head(donors$donation_prob)
## [1] 0.04967101 0.04967101 0.04967101 0.04967101 0.05294280 0.04967101
# Find the donation probability of the average prospect
mean(donors$donated)
## [1] 0.05040551
# For no donation
mean(!donors$donated)
## [1] 0.9495945
# Predict a donation if probability of donation is greater than average (0.0504)
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)
head(donors$donation_pred)
## [1] 0 0 0 0 1 0
# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
## [1] 0.794815

Nice work! With an accuracy of nearly 80%, the model seems to be doing its job. But is it too good to be true?

The limitations of accuracy In the previous exercise, you found that the logistic regression model made a correct prediction nearly 80% of the time. Despite this relatively high accuracy, the result is misleading due to the rarity of outcome being predicted.

The donors dataset is available in your workspace. What would the accuracy have been if a model had simply predicted ‘no donation’ for each person?

Possible Answers 80%

85%

90%

95% (answer - see working below)

Correct! With an accuracy of only 80%, the model is actually performing WORSE than if it were to predict non-donor for every record.

# For no donation
mean(!donors$donated)
## [1] 0.9495945
# Predict a donation if probability of donation is greater than average (0.0504)
donors$no_donation_pred <- ifelse(donors$donation_prob > 0.95, 0, 1)
head(donors$no_donation_pred)
## [1] 1 1 1 1 1 1
# Calculate the model's accuracy
mean(!donors$donated == donors$no_donation_pred)
## [1] 0.9495945

40 Model performance tradeoffs

41 Understanding ROC curves

Linear Regression ROC 1

Linear Regression ROC 1

Linear Regression ROC 3

Linear Regression ROC 3

Linear Regression ROC 2

Linear Regression ROC 2

Linear Regression ROC 4

Linear Regression ROC 4

42 Area under the ROC curve (AUC)

Linear Regression AUC 1

Linear Regression AUC 1

43 Using AUC and ROC appropriately

Linear Regression AUC 2

Linear Regression AUC 2

44 Calculating ROC Curves and AUC

The previous exercises have demonstrated that accuracy is a very misleading
measure of model performance on imbalanced datasets. Graphing the model’s
performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

In this exercise you will create a ROC curve and compute the area under the curve (AUC) to evaluate the logistic regression model of donations you built earlier.

The dataset donors with the column of predicted probabilities, donation_prob, is already loaded in your workspace.

Load the pROC package.

Create a ROC curve with roc() and the columns of actual and predicted donations.

Store the result as ROC.

Use plot() to draw the ROC object. Specify col = ‘blue’ to color the curve blue.

Compute the area under the curve with auc().

# Load the pROC package
# install.packages("pROC")
library(pROC)
## Warning: package 'pROC' was built under R version 3.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)

# Plot the ROC curve
plot(ROC, col = "blue")

# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.5102

Area under the curve: 0.5102

Awesome job! Based on this visualization, the model isn’t doing much better than baseline- a model doing nothing but making predictions at random.

While comparing AUCs for ROCs

45 Dummy variables, missing data, and interactions

46 Dummy coding categorical data

47 create gender factor

my_data$gender = factor(my_data$gender,
                    levels = c(0,1,2),
                    labels = c('Male','Female','Other'))

48 Imputing missing data

49 Interaction effects

Linear Regression Interact

Linear Regression Interact

Linear Regression Interact

Linear Regression Interact

50 interaction of obesity and smoking

glm(disease ~ obesity * smoking,
            data = health,
            factor = 'binomial')

51 Coding categorical features

Sometimes a dataset contains numeric values that represent a categorical feature.

In the donors dataset, wealth_rating uses numbers to indicate the donor’s wealth level:

0 = Unknown 1 = Low 2 = Medium 3 = High

This exercise illustrates how to prepare this type of categorical feature and the examines its impact on a logistic regression model.

The dataframe donors is loaded in your workspace.

Create a factor from the numeric wealth_rating with labels as shown above by passing the factor() function the column you want to convert, the individual levels, and the labels.

Use relevel() to change the reference category to Medium. The first argument should be your factor column.

Build a logistic regression model using the column wealth_rating to predict donated and display the result with summary().

str(donors$wealth_rating)
##  num [1:93462] 0 3 1 2 1 0 2 3 1 0 ...
length(unique(donors$wealth_rating))
## [1] 4
unique(donors$wealth_rating)
## [1] 0 3 1 2
# convert wealth rating to factor variable
donors$wealth_rating = factor(donors$wealth_rating,
                              levels = c(0,1,2,3),
                              labels = c("Unknown", "Low", "Medium", "High"))

# create reference category for dummy coding
# Use relevel() to change reference category
donors$wealth_rating <- relevel(donors$wealth_rating, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_rating, 
            family = 'binomial',
            data = donors))
## 
## Call:
## glm(formula = donated ~ wealth_rating, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_ratingUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_ratingLow     -0.05245    0.05332  -0.984    0.325    
## wealth_ratingHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5

Great job! What would the model output have looked like if this variable had been left as a numeric column?

52 Handling missing data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

The dataframe donors is loaded in your workspace.

Use summary() on donors to find the average age of prospects with non-missing data.

Use ifelse() and the test is.na(donors$age) to impute the average (rounded to 2 decimal places) for cases with missing age.

Create a binary dummy variable named missing_age indicating the presence of missing data using another ifelse() call and the same test.

summary(donors$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546
avg_age = mean(donors$age, na.rm = TRUE)
avg_age
## [1] 61.64787
donors$imputed_age = ifelse(is.na(donors$age), 61.65, donors$age)

summary(donors$imputed_age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   52.00   61.65   61.65   72.00   98.00
donors$missing_age = ifelse(is.na(donors$age), 1, 0)

#cross check the new age variables
donors %>%
  filter(is.na(age)) %>%
  select(age, missing_age, imputed_age) %>%
  head()
## # A tibble: 6 x 3
##     age missing_age imputed_age
##   <dbl>       <dbl>       <dbl>
## 1    NA           1        61.6
## 2    NA           1        61.6
## 3    NA           1        61.6
## 4    NA           1        61.6
## 5    NA           1        61.6
## 6    NA           1        61.6

Super! This is one way to handle missing data, but be careful! Sometimes missing data has to be dealt with using more complicated methods.

53 Understanding missing value indicators

A missing value indicator provides a reminder that, before imputation, there was a missing value present on the record.

Why is it often useful to include this indicator as a predictor in the model?

ANSWER THE QUESTION

Possible Answers A missing value may represent a unique category by itself

There may be an important difference between records with and without missing data

Whatever caused the missing value may also be related to the outcome

All of the above (answer)

Yes! Sometimes a missing value says a great deal about the record it appeared on!

54 Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M:

Recency Frequency Money Donors that haven’t given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.

Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction.

55 Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M:

Recency Frequency Money Donors that haven’t given both recently and frequently may be especially likely to give again; in other words, the combined impact of recency and frequency may be greater than the sum of the separate effects.

Because these predictors together have a greater impact on the dependent variable, their joint effect must be modeled as an interaction.

56 The donors dataset has been loaded for you.

Create a logistic regression model of donated as a function of money plus the interaction of recency and frequency. Use * to add the interaction term.

Examine the model’s summary() to confirm the interaction effect was added.

Save the model’s predicted probabilities as rfm_prob. Use the predict() function, and remember to set the type argument.

Plot a ROC curve by using the function roc(). Remember, this function takes the column of outcomes and the vector of predictions.

Compute the AUC for the new model with the function auc() and compare performance to the simpler model.

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ money + recency*frequency,family = 'binomial', data = donors)

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)
## 
## Call:
## glm(formula = donated ~ money + recency * frequency, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6
# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, type = "response")

# Plot the ROC curve and find AUC for the new model
library(pROC)
ROC <- roc(donors$donated,rfm_prob)
plot(ROC, col = "red")

auc(ROC)
## Area under the curve: 0.5785

Great work! Based on the ROC curve, you’ve confirmed that past giving patterns are certainly predictive of future giving.

57 Automatic feature selection

58 Stepwise regression

Linear Regression Stepwise 1

Linear Regression Stepwise 1

Linear Regression Stepwise 1

Linear Regression Stepwise 1

and

Linear Regression Stepwise 1

Linear Regression Stepwise 1

59 Stepwise regression caveats

Linear Regression Stepwise Caveats

Linear Regression Stepwise Caveats

60 The dangers of stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. Which of these is NOT one of these concerns?

ANSWER THE QUESTION

Possible Answers

It is not guaranteed to find the best possible model

A stepwise model’s predictions can not be trusted (answer - this is not a concern)

The stepwise regression procedure violates some statistical assumptions

It can result in a model that makes little sense in the real world

Correct! Though stepwise regression is frowned upon, it may still be useful for building predictive models in the absence of another starting place.

Building a stepwise regression model In the absence of subject-matter expertise, stepwise regression can assist with the search for the most important predictors of the outcome of interest.

In this exercise, you will use a forward stepwise approach to add predictors to the model one-by-one until no additional benefit is seen.

The donors dataset has been loaded for you.

Use the R formula interface with glm() to specify the base model with no predictors. Set the explanatory variable equal to 1.

Use the R formula interface again with glm() to specify the model with all predictors.

Apply step() to these models to perform forward stepwise regression. Set the first argument to null_model and set direction = ‘forward.’ This might take a while (up to 10 or 15 seconds) as your computer has to fit quite a few different models to perform stepwise selection.

Create a vector of predicted probabilities using the predict() function.

Plot the ROC curve with roc() and plot() and compute the AUC of the stepwise model with auc().

# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, 
                   scope = list(lower = null_model, upper = full_model), 
                   direction = "forward")
## Start:  AIC=37332.13
## donated ~ 1
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_rating      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
##                     Df Deviance   AIC
## + wealth_rating      3    28427 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28439 36949
## + age                1    28439 36949
## + imputed_age        1    28439 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28441 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_rating
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28425 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28427 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_rating + has_children
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_rating + has_children + 
##     pet_owner
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
plot(ROC, col = "red")

auc(ROC)
## Area under the curve: 0.5849

Area under the curve: 0.5849 Should be Area under the curve: 0.6006

Fantastic work! Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!

61 Making decisions with trees

62 A decsion tree model

Decision Tree Structure

Decision Tree Structure

63 Divide-and-conquer

Decision Tree Split 1

Decision Tree Split 1

Decision Tree Split 3

Decision Tree Split 3

Decision Tree Split 2

Decision Tree Split 2

64 Building trees in R

65 building a simple rpart classification tree

library(rpart)
m = rpart(outcome ~ loan_amount + credit_score, 
            data = loans,   
            method = class)

66 making predictions from an rpart tree

\[p = predict(m, testData, type = 'class')\]

67 Building a simple decision tree

The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club, a US-based peer-to-peer lending company.

You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

loans_org = read_csv("C:/shobha/R/DataCamp/dataFiles/CSV-files/loans.csv")
## Parsed with column specification:
## cols(
##   keep = col_double(),
##   rand = col_double(),
##   default = col_double(),
##   loan_amount = col_character(),
##   emp_length = col_character(),
##   home_ownership = col_character(),
##   income = col_character(),
##   loan_purpose = col_character(),
##   debt_to_income = col_character(),
##   credit_score = col_character(),
##   recent_inquiry = col_character(),
##   delinquent = col_character(),
##   credit_accounts = col_character(),
##   bad_public_record = col_character(),
##   credit_utilization = col_character(),
##   past_bankrupt = col_character()
## )
dim(loans_org)
## [1] 39732    16
glimpse(loans_org)
## Observations: 39,732
## Variables: 16
## $ keep               <dbl> 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1...
## $ rand               <dbl> 0.13046525, 0.99815098, 0.62827558, 0.25240...
## $ default            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1...
## $ loan_amount        <chr> "LOW", "LOW", "LOW", "MEDIUM", "LOW", "LOW"...
## $ emp_length         <chr> "10+ years", "< 2 years", "10+ years", "10+...
## $ home_ownership     <chr> "RENT", "RENT", "RENT", "RENT", "RENT", "RE...
## $ income             <chr> "LOW", "LOW", "LOW", "MEDIUM", "HIGH", "LOW...
## $ loan_purpose       <chr> "credit_card", "car", "small_business", "ot...
## $ debt_to_income     <chr> "HIGH", "LOW", "AVERAGE", "HIGH", "AVERAGE"...
## $ credit_score       <chr> "AVERAGE", "AVERAGE", "AVERAGE", "AVERAGE",...
## $ recent_inquiry     <chr> "YES", "YES", "YES", "YES", "NO", "YES", "Y...
## $ delinquent         <chr> "NEVER", "NEVER", "NEVER", "MORE THAN 2 YEA...
## $ credit_accounts    <chr> "FEW", "FEW", "FEW", "AVERAGE", "MANY", "AV...
## $ bad_public_record  <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "...
## $ credit_utilization <chr> "HIGH", "LOW", "HIGH", "LOW", "MEDIUM", "ME...
## $ past_bankrupt      <chr> "NO", "NO", "NO", "NO", "NO", "NO", "NO", "...
names(loans_org)
##  [1] "keep"               "rand"               "default"           
##  [4] "loan_amount"        "emp_length"         "home_ownership"    
##  [7] "income"             "loan_purpose"       "debt_to_income"    
## [10] "credit_score"       "recent_inquiry"     "delinquent"        
## [13] "credit_accounts"    "bad_public_record"  "credit_utilization"
## [16] "past_bankrupt"
dim(loans_org)
## [1] 39732    16
head(loans_org$outcome)
## Warning: Unknown or uninitialised column: 'outcome'.
## NULL
loans = loans_org %>% select(-keep, -rand)

dim(loans)
## [1] 39732    14
names(loans)
##  [1] "default"            "loan_amount"        "emp_length"        
##  [4] "home_ownership"     "income"             "loan_purpose"      
##  [7] "debt_to_income"     "credit_score"       "recent_inquiry"    
## [10] "delinquent"         "credit_accounts"    "bad_public_record" 
## [13] "credit_utilization" "past_bankrupt"
head(loans$default)
## [1] 0 1 0 0 0 0
loans = loans %>%
  mutate(outcome = factor(ifelse((loans$default==1), "repaid", "default")))

head(loans$default)
## [1] 0 1 0 0 0 0
head(loans$outcome)
## [1] default repaid  default default default default
## Levels: default repaid
loans = loans %>%
  select(-default)

dim(loans)
## [1] 39732    14

The dataset loans is already in your workspace.

Load the rpart package.

Fit a decision tree model with the function rpart().

Supply the R formula that specifies outcome as a function of loan_amount and credit_score as the first argument.

Leave the control argument alone for now. (You’ll learn more about that later!)

Use predict() with the resulting loan model to predict the outcome for the good_credit applicant. Use the type argument to predict the ‘class’ of the outcome.

Do the same for the bad_credit applicant.

str(loans$bad_public_record)
##  chr [1:39732] "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" "NO" ...
table(loans$bad_public_record)
## 
##    NO   YES 
## 37613  2119
head(loans,1)
## # A tibble: 1 x 14
##   loan_amount emp_length home_ownership income loan_purpose debt_to_income
##   <chr>       <chr>      <chr>          <chr>  <chr>        <chr>         
## 1 LOW         10+ years  RENT           LOW    credit_card  HIGH          
## # ... with 8 more variables: credit_score <chr>, recent_inquiry <chr>,
## #   delinquent <chr>, credit_accounts <chr>, bad_public_record <chr>,
## #   credit_utilization <chr>, past_bankrupt <chr>, outcome <fct>
good_credit = data.frame("loan_amount" = "LOW",
                         "emp_length" = "10+ years",
                         "home_ownership" = "MORTGAGE",
                         "income" = "HIGH",
                         "loan_purpose" = "major_purchase",
                         "debt_to_income" = "AVERAGE",
                         "credit_score" = "HIGH",
                         "recent_inquiry" = "NO",
                         "delinquent" = "NEVER",
                         "credit_accounts" = "MANY",
                         "bad_public_record" = "NO",
                         "credit_utilization" = "LOW",
                         "past_bankrupt" = "NO",
                         "outcome" = "repaid" )

good_credit
##   loan_amount emp_length home_ownership income   loan_purpose
## 1         LOW  10+ years       MORTGAGE   HIGH major_purchase
##   debt_to_income credit_score recent_inquiry delinquent credit_accounts
## 1        AVERAGE         HIGH             NO      NEVER            MANY
##   bad_public_record credit_utilization past_bankrupt outcome
## 1                NO                LOW            NO  repaid
bad_credit = data.frame("loan_amount" = "LOW",
                        "emp_length" = "6 - 9 years",
                        "home_ownership" = "RENT",
                        "income" = "HIGH",
                        "loan_purpose" = "car",
                        "debt_to_income" = "LOW",
                        "credit_score" = "LOW",
                        "recent_inquiry" = "YES",
                        "delinquent" = "NEVER",
                        "credit_accounts" = "FEW",
                        "bad_public_record" = "NO",
                        "credit_utilization" = "HIGH",
                        "past_bankrupt" = "NO",
                        "outcome" = "repaid" )

bad_credit
##   loan_amount  emp_length home_ownership income loan_purpose
## 1         LOW 6 - 9 years           RENT   HIGH          car
##   debt_to_income credit_score recent_inquiry delinquent credit_accounts
## 1            LOW          LOW            YES      NEVER             FEW
##   bad_public_record credit_utilization past_bankrupt outcome
## 1                NO               HIGH            NO  repaid
# Load the rpart package
library(rpart)

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# Make a prediction for someone with good credit
predict(loan_model, good_credit, type = "class")
## [1] default
## Levels: default repaid
# Make a prediction for someone with bad credit
predict(loan_model, bad_credit, type = "class")
## [1] default
## Levels: default repaid

Great job! Growing a decision tree is certainly faster than growing a real tree!

68 Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

The model loan_model that you fit in the last exercise is in your workspace.

Type loan_model to see a text representation of the classification tree.

Load the rpart.plot package.

Apply the rpart.plot() function to the loan model to visualize the tree.

See how changing other plotting parameters impacts the visualization by running the supplied command.

# Examine the loan_model object
loan_model
## n= 39732 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 39732 5654 default (0.8576966 0.1423034) *
"
first build model using good_credit and bad_credit
"
## [1] "\nfirst build model using good_credit and bad_credit\n"
# Load the rpart.plot package
#install.packages("rpart.plot")
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3
# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, 
           type = 3, 
           box.palette = c("red", "green"), 
           fallen.leaves = TRUE)

Awesome! What do you think of the fancy visualization?

69 Growing larger calssification trees

70 Choosing where to split

Decision Tree Split Optiona A

Decision Tree Split Optiona A

and

Decision Tree Split Optiona B

Decision Tree Split Optiona B

71 Axis-parallel splits

Decision Tree Axis Parallel 1

Decision Tree Axis Parallel 1

Decision Tree Axis Parallel 2

Decision Tree Axis Parallel 2

Decision Tree Bigger Tree 2

Decision Tree Bigger Tree 2

72 Evaluating model performance

Decision Tree Test Set

Decision Tree Test Set

73 Why do some branches split?

A classification tree grows using a divide-and-conquer process. Each time the tree grows larger, it splits groups of data into smaller subgroups, creating new branches in the tree.

Given the following groups to divide-and-conquer, which one would the algorithm prioritize to split first?

ANSWER THE QUESTION

Possible Answers The group with the largest number of examples.

The group creating branches that improve the model’s prediction accuracy.

The group it can split to create the greatest improvement in subgroup homogeneity. (answer)

The group that has not been split already.

Correct! Divide-and-conquer always looks to create the split resulting in the greatest improvement to purity.

74 Creating random test datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

Decision Tree Test Set

Decision Tree Test Set

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets.

The dataset loans is in your workspace.

Apply the nrow() function to determine how many observations are in the loans dataset, and the number needed for a 75% sample.

Use the sample() function to create an integer vector of row IDs for the 75% sample. The first argument of sample() should be the number of rows in the data set, and the second is the number of rows you need in your training set.

Subset the loans data using the row IDs to create the training dataset. Save this as loans_train.

Subset loans again, but this time select all the rows that are not in sample_rows. Save this as loans_test

# Determine the number of rows for training
nrow(loans)
## [1] 39732
#[1] 39732

nrow(loans)*0.75
## [1] 29799
#[1] 29799

# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans)*0.75)
head(sample_rows)
## [1] 12187 35294 22670 11288 29740 18788
#[1] 32740  8511 37884  3580 34854 39161

length(sample_rows)
## [1] 29799
#[1] 29799

# Create the training dataset
loans_train <- loans[sample_rows,]

# Create the test dataset
loans_test <- loans[- sample_rows,]

Amazing work! Creating a test set is an easy way to check your model’s performance.

75 Building and evaluating a larger tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

The rpart package is loaded into the workspace and the loans_train and loans_test datasets have been created.

Use rpart() to build a loan model using the training dataset and all of the available predictors. Again, leave the control argument alone.

Applying the predict() function to the testing dataset, create a vector of predicted outcomes. Don’t forget the type argument.

Create a table() to compare the predicted values to the actual outcome values.

Compute the accuracy of the predictions using the mean() function.

# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., 
                    data = loans_train, 
                    method = "class", 
                    control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model,loans_test, type = 'class')
"
        
          default repaid
  default    8330    251
  repaid     1277     75

"
## [1] "\n        \n          default repaid\n  default    8330    251\n  repaid     1277     75\n\n"
# Examine the confusion matrix
table(loans_test$outcome, loans_test$pred)
##          
##           default repaid
##   default    8302    207
##   repaid     1356     68
# Compute the accuracy on the test dataset
mean(loans_test$outcome == loans_test$pred)
## [1] 0.8426457
#[1] 0.8461693

Awesome! How did adding more predictors change the model’s performance?

76 Conducting a fair performance evaluation

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

Which of these is NOT true about the evaluation of decision tree performance?

ANSWER THE QUESTION

Possible Answers Decision trees sometimes overfit the training data.

The model’s accuracy is unaffected by the rarity of the outcome. (answer - Not true, rarity of data impacts the model’s performance)

Performance on the training dataset can overestimate performance on future data.

Creating a test dataset simulates the model’s performance on unseen data.

Right! Rare events cause problems for many machine learning approaches.

77 Tending to classification trees

78 Pre-pruning

79 Post-pruning

Decision Tree Post Prune 1

Decision Tree Post Prune 1

Decision Tree Post Prune 2

Decision Tree Post Prune 2

80 Pre- and post-pruning with R

81 pre-pruning with rpart

How to build rpart model and prune the tree.

library(rpart)
prune_control = rpart.control(maxdepth = 30, minsplit = 20)

m = rpart(repaid ~ credit_score + request_amt,
          data = loans,
          method = 'class',
          control = prune_control)
# post-pruning with rpart
m = rpart(repaid ~ credit_score + request_amt,
          data = loans,
          method = 'class')

plotcp(m)

m_pruned = prune(m, cp = 0.20)

82 Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

The rpart package is loaded into the workspace.

Add a maxdepth parameter to the rpart.control() object to set the maximum tree depth to six. Leave the parameter cp = 0. Pass the result of rpart.control() as the control parameter in your rpart() call.

See how the test set accuracy of the simpler model compares to the original accuracy of 58.3%. - First create a vector of predictions using the predict() function. - Compare the predictions to the actual outcomes and use mean() to calculate the accuracy.

Add a minsplit parameter to the rpart.control() object to require 500 observations to split. Again, leave cp = 0.

Again compare the accuracy of the simpler tree to the original.

# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", 
                    control = rpart.control(cp = 0, maxdepth = 6))

# Compute the accuracy of the simpler tree
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.8559348
# Grow a tree with minsplit of 500
loan_model2 <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, minsplit = 500))

# Compute the accuracy of the simpler tree
loans_test$pred2 <- predict(loan_model2, loans_test, type = "class")
mean(loans_test$pred2 == loans_test$outcome)
## [1] 0.8566395

Nice work! It may seem surprising, but creating a simpler decision tree may actually result in greater performance on the test dataset.

83 Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.

The rpart package is loaded into the workspace, along with loans_test and loans_train.

Use all of the applicant variables and no pre-pruning to create an overly complex tree.

Make sure to set cp = 0 in rpart.control() to prevent pre-pruning.

Create a complexity plot by using plotcp() on the model.

Based on the complexity plot, prune the tree to a complexity of 0.0014 using the prune() function with the tree and the complexity parameter.

Compare the accuracy of the pruned tree to the original accuracy of 58.3%. To calculate the accuracy use the predict() and mean() functions.

# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$outcome == loans_test$pred)
## [1] 0.8566395

Great job! As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.

84 Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used?

ANSWER THE QUESTION

Possible Answers Simpler trees are easier to interpret

Simpler trees using early stopping are faster to train

Simpler trees may perform better on the testing data

All of the above (answer)

Yes! There are many benefits to creating carefully pruned decision trees!

Seeing the forest from the trees

85 Understanding random forest

Decision Tree Forest

Decision Tree Forest

86 Making decisions as an ensemble

Decision Tree Ensemble 1

Decision Tree Ensemble 1

Decision Tree Ensemble 2

Decision Tree Ensemble 2

87 Random forests in R

88 building a simple random forest

library(randomForest)

m = randomForest(repaid ~ credit_score + request_amt, data = test_loans,
ntree = 500, # number of trees in the forest
mtry = sqrt(p)) # number of predictors (p) per tree)


# making predictions from a random forest
p = predict(m, test_data)

89 Understanding random forests

Groups of classification trees can be combined into an ensemble that generates a single prediction by allowing the trees to ‘vote’ on the outcome.

Why might someone think that this could result in more accurate predictions than a single tree?

ANSWER THE QUESTION

Possible Answers Each of tree in the forest is larger and more complex than a typical single tree.

Every tree in a random forest uses the complete set of predictors.

The diversity among the trees may lead it to discover more subtle patterns. (answer)

The random forest is not affected by noisy data.

Yes! The teamwork-based approach of the random forest may help it find important trends a single tree may miss.

90 Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

# Load the randomForest package
library(randomForest)

# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train)

The above gives below error:

Error in randomForest.default(m, y, …) : NA/NaN/Inf in foreign function call (arg 1)

The Chr predictors need to be converted to factors to remove the above error. Let’s try with two predictors loan_amount and credit_score

# Convert chr values to factor for two predictors
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
table(loans_train$loan_amount)
## 
##   HIGH    LOW MEDIUM 
##   8361   7378  14060
str(loans_train$loan_amount)
##  chr [1:29799] "LOW" "LOW" "LOW" "LOW" "HIGH" "MEDIUM" "HIGH" "MEDIUM" ...
loans_train$loan_amount = as.factor(loans_train$loan_amount)
table(loans_train$loan_amount)
## 
##   HIGH    LOW MEDIUM 
##   8361   7378  14060
str(loans_train$loan_amount)
##  Factor w/ 3 levels "HIGH","LOW","MEDIUM": 2 2 2 2 1 3 1 3 1 2 ...
table(loans_train$credit_score)
## 
## AVERAGE    HIGH     LOW 
##   20283    5936    3580
str(loans_train$credit_score)
##  chr [1:29799] "AVERAGE" "AVERAGE" "AVERAGE" "LOW" "HIGH" "AVERAGE" ...
loans_train$credit_score = as.factor(loans_train$credit_score)
table(loans_train$credit_score)
## 
## AVERAGE    HIGH     LOW 
##   20283    5936    3580
str(loans_train$credit_score)
##  Factor w/ 3 levels "AVERAGE","HIGH",..: 1 1 1 3 2 1 1 3 1 2 ...
# Convert outcome to numberic while passing to randomForest 
loan_model <- randomForest(outcome ~ loan_amount + credit_score, 
                           data = loans_train,
                           ntree = 500)


loan_model
## 
## Call:
##  randomForest(formula = outcome ~ loan_amount + credit_score,      data = loans_train, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 14.2%
## Confusion matrix:
##         default repaid class.error
## default   25569      0           0
## repaid     4230      0           1

Make the prediction for test data

# First convert the predictors to factors
test_predictors = loans_test %>% select(loan_amount, credit_score)
test_predictors$loan_amount = as.factor(test_predictors$loan_amount)
test_predictors$credit_score = as.factor(test_predictors$credit_score)

# Predict using random forest model
loans_test$pred <- predict(loan_model, test_predictors)

# Compute the accuracy of the random forest
mean(loans_test$outcome == loans_test$pred)
## [1] 0.8566395

Repeat by adding more predictors. To Do: write a function to convert chr variables to factors if they have unique levels/categories

Wow! Great job! Now you’re really a classification pro! Classification is only one of the problems you’ll have to tackle as a data scientist. Check out some other machine learning courses to learn more about supervised and unsupervised learning.